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We  deal  with  the  problem  of  recognizing  social  roles 
played  by  people  in  an  event.  Social  roles  are  governed  by 
human  interactions,  and  form  a  fundamental  component  of 
human  event  description.  We  focus  on  a  weakly  supervised 
setting,  where  we  are  provided  different  videos  belonging 
to  an  event  class,  without  training  role  labels.  Since  social 
roles  are  described  by  the  interaction  between  people  in  an 
event,  we  propose  a  Conditional  Random  Field  to  model 
the  inter-role  interactions,  along  with  person  specific  social 
descriptors.  We  develop  tractable  variational  inference  to 
simultaneously  infer  model  weights,  as  well  as  role  assign¬ 
ment  to  all  people  in  the  videos.  We  also  present  a  novel 
YouTube  social  roles  dataset  with  ground  truth  role  annota¬ 
tions,  and  introduce  annotations  on  a  subset  of  videos  from 
the  TRECVID-MED11  [1]  event  kits  for  evaluation  pur¬ 
poses.  The  performance  of  the  model  is  compared  against 
different  baseline  methods  on  these  datasets. 


1.  Introduction 

Humans  are  social  animals.  Our  ability  to  comprehend 
human  relations  stands  fundamental  to  our  survival,  devel¬ 
opment  and  social  life.  We  understand  such  relationships 
in  terms  of  social  roles  assumed  by  people,  and  tend  to  de¬ 
scribe  events  using  these  roles.  For  instance,  we  would  de¬ 
scribe  the  birthday  video  in  Fig.  1  as  “Parents  helping  the 
birthday  boy  cut  a  cake”,  rather  than  “Two  people  helping 
another  person  cut  a  cake”.  Typically,  social  roles  answer 
semantic  queries  like,  “Who  is  doing  what  in  an  event?”. 
While  the  tasks  of  identifying  the  action  and  detecting  the 
person  are  widely  studied  in  computer  vision,  the  problem 
of  role  assignment  is  relatively  new  and  equally  interesting. 

Social  role  discovery  derives  motivation  from  the  field  of 
“Role  Theory”  [2]  in  sociology,  which  observes  that  people 
behave  in  predictable  ways  based  on  their  social  roles.  This 
shows  that  knowing  the  role  of  a  person  can  help  determine 
his/her  interactions  with  the  environment  and  vice-versa.  In 
computer  vision,  [13]  leveraged  the  same  intuition  to  build 
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Figure  1.  When  people  interact  in  an  event,  they  assume  event  spe¬ 
cific  social  roles.  Social  roles  act  as  identities  for  the  individuals 
and  can  help  us  describe  the  event  in  terms  of  these  roles.  Role 
recognition  is  fundamental  in  understanding  a  human  event. 


a  human  activity  recognition  model.  Also,  the  knowledge 
of  social  roles  can  help  determine  the  interesting  segments 
of  social  event  footages  [7]  and  sports  videos. 

The  definition  of  social  roles  is  event  specific,  and  can 
sometimes  be  abstract  such  as,  people  “helping”,  “visiting” 
or  “residing”  in  a  nursing  home  [13],  making  role  identifi¬ 
cation  a  difficult  human  task.  Ideally,  we  would  like  to  auto¬ 
matically  discover  such  interaction-based  role  assignments 
in  any  event.  Also,  annotating  roles  is  time  consuming  and 
needs  knowledge  of  the  event.  Recognizing  these  difficul¬ 
ties,  we  formulate  the  problem  of  social  role  discovery  in 
a  weakly  supervised  framework.  Given  a  set  of  videos  be¬ 
longing  to  a  social  event  without  training  labels  for  the  peo¬ 
ple  in  the  videos,  we  group  them  into  different  social  roles. 
The  event  label  acts  as  the  weak  annotation  in  our  setting, 
restricting  the  discovered  roles  to  be  event  specific. 

The  problem  is  amply  challenging  due  to  the  wide  vari¬ 
ation  in  appearance,  scale,  location  and  scene  context  of  a 
role  across  different  videos  as  seen  in  Fig.  2.  As  illustrated 
in  Fig.  1,  it  is  difficult  to  determine  roles  by  observing  peo¬ 
ple  individually.  Rather,  social  role  discovery  is  an  attempt 
to  identify  people  based  on  their  interactions  in  an  event. 
Modeling  such  interactions  in  the  absence  of  role  labels  dur¬ 
ing  training  acts  as  an  additional  challenge. 

In  order  to  solve  this  problem  of  weakly  supervised 
role  assignment,  we  propose  a  Conditional  Random  Field 
(CRF)  to  capture  inter-role  interaction  cues,  and  develop 
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Figure  2.  Sample  frames  from  different  events  in  the  YouTube  Social  Roles  dataset  are  shown  with  ground  truth  role  annotations  used  for 
evaluation.  The  different  roles  in  each  event  are  marked  by  the  colors  noted  in  the  last  column.  The  huge  variation  in  appearance,  location, 
scale  and  scene  context  for  a  role  across  different  videos  can  be  seen. 


a  tractable  variational  inference  procedure  to  jointly  learn 
role  labels  as  well  as  model  weights.  Further,  to  evaluate 
the  model  performance,  we  introduce  a  novel  YouTube  so¬ 
cial  roles  dataset  in  Sec.  5.1,  accompanied  by  event  specific 
ground  truth  role  annotations  for  the  people  in  the  videos. 
It  is  to  be  noted  that  the  role  labels  are  only  used  for  model 
evaluation  and  not  for  the  training.  We  also  provide  role 
annotations  for  a  subset  of  videos  from  two  events  of  the 
TRECVID  MED- 11  [1]  event  kits,  and  test  our  model  per¬ 
formance  on  these  videos.  Experiments  on  these  datasets 
show  that  our  method  achieves  encouraging  performance  in 
weakly  supervised  social  role  assignment. 

2.  Related  Work 

Socially  aware  video  and  image  analysis  Recent  works 
on  social  network  construction  and  interaction  understand¬ 
ing  is  relevant  to  our  work  on  social  role  recognition.  [2:  ] 
associates  people  in  a  video  using  face  recognition  and  track 
matching.  [4,  5]  clusters  people  in  a  movie  into  adversarial 
groups.  [:  ]  uses  scene  context  and  visual  concept  attributes 
to  build  social  relation  network.  [23]  also  builds  a  social 
role  network  based  on  their  co-occurrence  of  movie  char¬ 
acters  in  different  scenes.  These  works  do  not  group  peo¬ 
ple  across  different  videos,  but  consider  people  within  one 
movie.  [22]  uses  appearance  features  to  predict  the  rela¬ 
tionship  between  people  by  training  on  images  with  weak 


relationship  labels,  while  [19]  performs  occupation  classifi¬ 
cation  based  on  clothing  and  context  in  human  images.  [20] 
studied  the  problem  of  face  recognition  in  social  context. 

Social  Interaction  in  Action  Recognition  Another  re¬ 
lated  line  of  work  has  been  the  use  of  social  interaction 
to  aid  group  action  recognition  [14,  3,  6].  [14]  explicitly 
models  human  interaction,  while  [3]  uses  features  of  peo¬ 
ple  in  spatio-temporal  vicinity  to  detect  group  activities  and 
jointly  track  multiple  people.  [18]  also  uses  social  group¬ 
ing  to  help  multi  target  tracking.  [10]  uses  social  context  in 
group  photos  to  make  better  prediction  of  human  attributes 
and  scene  semantics.  [  ]  recognizes  group  social  activi¬ 
ties  through  attribute  learning.  [17]  develops  interaction 
features  based  on  facial  orientation  to  recognize  activities 
like  hand- shaking.  Similarly,  [16]  also  models  facial  atten¬ 
tion.  Although  the  above  works  capture  social  interactions 
in  some  form,  they  do  not  explicitly  identify  the  roles  as¬ 
sumed  by  people  during  a  social  event. 

Role  Recognition  Recently,  [7,  1  ]  used  social  roles  to 
predict  group  activities.  [7]  found  face  attention  patterns  in 
first  person  videos  to  detect  interaction  activities  like  mono¬ 
logue,  discussion  and  dialogue.  They  clustered  faces  in 
training  videos  based  on  attention  patterns,  and  represented 
frame  sequences  by  histogram  of  cluster  occurrences.  [13] 
predicted  role  labels  like  “defender”  and  “attacker”  in  sports 
videos  to  identify  group  activities.  They  used  training  labels 
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to  learn  role  assignments  based  on  spatio-temporal  interac¬ 
tion  between  players.  However,  in  our  work  we  are  not  pro¬ 
vided  role  annotations,  and  we  wish  to  discover  interaction- 
based  roles  automatically  by  studying  different  instances  of 
an  event.  We  also  use  richer  interaction  features. 

3.  Our  Approach 

We  define  social  role  discovery  as  a  weakly  supervised 
problem,  where  the  training  role  labels  for  the  people  in  the 
videos  are  not  available.  We  are  only  provided  the  event  la¬ 
bel  for  each  video,  and  the  number  of  roles  to  be  discovered 
in  an  event.  We  assume  that  every  video  is  pre-processed  to 
obtain  individual  human  tracks  similar  to  [6,  13]. 

Social  roles  are  not  only  decided  by  person  specific 
descriptors,  but  also  by  the  interaction  between  people. 
Hence,  any  model  used  to  discover  social  roles  should  be 
capable  of  incorporating  this  information.  However,  inter¬ 
action  in  an  event  is  usually  restricted  to  a  small  set  of  roles. 
In  our  approach,  every  event  has  a  reference  role,  and  the 
interaction  of  any  person  with  this  reference  role  is  most 
significant.  To  understand  this,  consider  a  birthday ,  where 
the  important  interactions  mostly  involve  the  “birthday  per¬ 
son”.  With  this  assumption,  it  is  sufficient  to  model  the  in¬ 
teraction  of  any  person  only  with  the  reference  role.  This 
is  a  realistic  simplification,  enabling  us  to  perform  tractable 
inference  as  shown  in  Sec.  4.  One  instance  of  the  reference 
role  is  assumed  to  be  present  in  every  video  belonging  to  the 
event  class.  We  refer  to  the  other  roles  as  secondary  roles. 

3.1.  Model  Formulation 

We  present  a  CRF  model  which  accounts  for  the  ref¬ 
erence  role  interaction  with  other  roles  in  a  video.  An 
overview  of  our  approach  is  shown  in  Fig.  3,  along  with 
the  factor  graph  of  our  model.  As  illustrated,  to  capture 
person  specific  social  cues,  we  extract  unary  features  (4/u) 
from  each  human  track,  describing  spatio-temporal  activity, 
human  appearance  and  human-object  interaction.  Similarly, 
to  represent  interaction  based  social  cues,  pairwise  features 
(4/p)  describing  proxemic  touch  codes,  and  spatial  proxim¬ 
ity  are  extracted.  Our  CRF  model  uses  these  features  to 
perform  weakly  supervised  social  role  recognition. 

Let  Fv  be  the  set  of  people  in  a  video  v  and  s \  be  the 
social  role  assigned  to  a  person  p \  E  ¥v.  We  want  to  assign 
social  roles,  and  jointly  learn  model  weights  by  maximizing 
the  log  likelihood  of  the  CRF  shown  in  Eq.  1 . 

argmax  V]{  V'a  •  'Sfu(pi,8i)  +  (1) 

se,cx,/3  y  l  v 
y  i 

^2  ^p(pVrn,P1j,s’j)  ~  Zv 
Pj^Pm 


(a)  (b) 


Figure  3.  (a)  The  features  extracted  by  our  model  are  illustrated  on 
a  sample  birthday  video  frame.  Unary  features  are  represented  in 
blue,  while  the  pairwise  features  are  shown  in  red.  (b)  The  factor 
graph  of  our  CRF  model  is  shown.  The  observed  variables  are 
shaded,  m  is  the  index  of  the  reference  role  in  the  video  v.  The 
model  variables  are  as  defined  in  Sec.  3.1. 

where  the  denotes  the  reference  role  in  the  event  E ,  and 
the  person  holding  the  reference  role  in  v.  The  model 
potentials  are  defined  as 

a-*«(Pi,a?)  =  •  l(a  =  a«)*«(Pi).  (2) 

S 

P-'&piPwPj,8*)  =  E  Ps  '  V®  = 

s^mE 

In  Eq.  1,  se  is  the  complete  social  role  assignment  to  all 
people  in  the  event,  and  is  the  log-partition  function  for 
the  video  v.  Ea  and  are  the  covariances  of  the  Gaussian 
priors  on  a  and  /3  respectively.  Note  that  the  model  only 
considers  interaction  of  different  roles  with  the  reference 
role,  in  accordance  with  our  assumption,  and  every  video  is 
assumed  to  contain  one  person  playing  this  reference  role. 
a  and  /3  are  the  unary  and  pairwise  weights  to  be  learnt 
respectively.  A  factor  graph  of  the  model  is  shown  in  Fig.  3 

3.2.  Unary  Features 

The  unary  feature  4/ u  captures  role  specific  social  cues 
extracted  from  human  tracks,  and  their  interaction  with  the 
event  environment.  can  be  expanded  into  four  compo¬ 
nents  as  shown  below. 

Histogram  of  Gradient  Feature  oG\  Bag  of  densely 
computed  HoG3D  [1  ]  words  of  dimension  1429  along  the 
human  track  is  used  as  low-level  features  to  capture  the  in¬ 
dividual  actions. 

Spatio-Temporal  Feature  4/fT:  A  person’s  movement 
in  an  event  is  another  useful  cue  regarding  his/  her  role. 
For  example,  the  “bride”  often  walks  down  the  aisle  in  a 
church  wedding.  The  human  motion  between  two  frames  is 
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binned  along  8  directions  to  form  a  trajectory  feature  sim¬ 
ilar  to  [12].  These  features  are  normalized  across  different 
people  in  a  video  to  partly  account  for  camera  motion. 

Object  Interaction  Feature  The  interaction  of  a 
person  with  the  event  environment  plays  a  key  role  in  deter¬ 
mining  his/  her  role,  “birthday  person”  cutting  a  “cake”  and 
“function  host”  talking  at  the  “lectern”  are  representative 
examples.  In  the  current  work,  we  extract  interaction  fea¬ 
tures  corresponding  to  only  these  two  objects  in  the  respec¬ 
tive  events.  [8]  is  used  to  obtain  specific  object  detection 
scores  in  a  video.  These  scores  are  spatially  pooled  similar 
to  [1'  ]  in  the  periphery  of  the  person’s  bounding  box  and 
averaged  across  multiple  frames  to  form  an  object  interac¬ 
tion  feature  of  dimension  48  for  every  event  object. 

Social  Feature  These  features  capture  two  im¬ 

portant  social  aspects  of  a  person,  representing  gender  and 
clothing.  Such  cues  are  important  in  events  like  wedding. 
This  would  also  capture  the  gender  bias  in  certain  roles  like 
“brides”.  We  first  use  [27]  to  detect  faces,  and  obtain  scores 
1  for  gender  classification.  The  scores  are  averaged  across 
frames  to  form  the  gender  feature.  The  clothing  of  a  person 
is  represented  by  the  RGB  color  histogram  with  32  bins. 

3.3.  Pairwise  Interaction  Features 

Human  interaction  forms  an  important  basis  for  social 
role  definitions.  For  instance,  the  “parent”  in  a  birthday 
is  distinguished  from  “guests”  by  their  interaction  with  the 
“birthday  person”.  Similarly  “bride-groom”,  “instructor- 
student”  interactions  separate  the  respective  roles  from  oth¬ 
ers.  These  interactions  are  recorded  by  the  pairwise  feature 
composed  of  two  components  as  shown  below. 

Proxemic  Interaction  Feature  \E^ rox".  The  proxemic 
interaction  of  two  people  provides  interesting  insights  re¬ 
garding  the  relation  between  roles  in  an  event  such  as  the 
touch-code  between  a  “parent”  and  the  “birthday  child”. 
The  use  of  proxemics  for  describing  human-human  rela¬ 
tions  was  introduced  in  [24],  where  the  authors  classify 
proxemics  between  two  people  into  6  classes  with  20  mod¬ 
els.  Proxemics  are  also  referred  as  touch-codes,  indicating 
the  way  people  touch  each  other.  For  every  pair  of  humans 
in  a  video,  we  use  all  20  models  from  [24]  to  find  prox¬ 
emic  scores  in  different  frames.  The  scores  are  normalized 
across  all  human  pairs  in  a  given  video  and  split  into  16  bins 
for  every  model,  to  form  our  final  proxemic  descriptor.  The 
scores  are  set  to  a  minimum  value,  if  a  pair  of  people  are 
never  sufficiently  close  to  each  other. 

Spatio-Temporal  Interaction  Feature  4/ :  The  spa¬ 
tial  separation  of  people  across  time  is  a  simple  but  power¬ 
ful  measure  of  human  interaction  in  a  video.  For  instance, 
the  “bride”  and  “groom”  are  always  near  each  other  in  a 
wedding,  while  the  “groomsmen”  are  farther  away  from  the 

1  We  use  software  from  http://cmp.felk.cvut.cz/  fisarond/demo/ 


“bride”.  The  spatial  distance  between  a  pair,  normalized  by 
bounding  box  dimensions  at  different  time  instants  are  used. 

4.  Inference 

The  difficulty  of  solving  Eq.  1  arises  due  to  the  corre¬ 
lation  between  different  social  roles  and  the  coupling  in¬ 
troduced  by  Zv.  [2'  ]  proposed  a  mean  field  approxima¬ 
tion  to  solve  Conditional  Topic  Random  Fields,  with  simple 
chain  connected  CRFs  and  CRFs  without  interaction  poten¬ 
tials.  Along  similar  lines,  we  develop  a  variational  inference 
method  to  find  an  approximate  solution  for  our  graphical 
model.  We  show  that  the  simplifying  assumption  of  interac¬ 
tions  being  restricted  to  the  reference  role,  helps  us  perform 
tractable  inference  as  a  part  of  the  optimization  procedure. 
We  also  introduce  a  variational  approximation  to  the  social 
role  probability  distribution  in  a  video,  with  similar  depen¬ 
dencies  as  the  original  model. 

We  formulate  the  variational  approximation  q  of  the 
model  distribution  as  shown  in  Eq.  3,  where  sv  denotes  the 
role  assignment  to  all  people  in  the  video  v. 

q{o!,/3,sE\Xa,  =  (3) 

n  i  )  n  q(pk  i  -V  >  <4* )  n  q^v \^v^v) 

j  k  v 

The  distributions  over  a  and  / 3  are  approximated  by  uni¬ 
variate  normal  distribution  with  means  given  by  Aa,  \p  and 
variances  cr^,  cr^.  (j)v  is  a  factor  giving  the  probability  of  a 
person  being  assigned  the  reference  role  in  the  video.  is 
a  set  of  \¥v  \  factors,  where  0^  is  the  secondary  role  prob¬ 
ability  matrix  for  other  people  in  the  video,  when  p \  is  as¬ 
signed  the  reference  role.  0,  0  are  formally  defined  in  Eq.  4. 
This  variational  approximation  of  the  social  role  probability, 
retains  the  dependencies  in  our  original  structure.  It  repre¬ 
sents  one  predominant  reference  role,  with  secondary  role 
assignments  dependent  on  this  reference  role. 

4>v(Pi)  =  p{Si=mE )  (4) 

i’l i)(PVj,s)  =  P(s]  =  =  mE),  j^i,  s^mE 

Inference  is  then  carried  out  through  coordinate  ascent. 
In  each  iteration,  the  updates  for  0, 0  require  inference  in 
the  CRF  model,  with  the  model  weights  fixed.  When  the 
model  weights  are  fixed,  our  graph  reduces  to  a  tree  for  each 
individual  video,  allowing  us  to  perform  exact  clique-tree 
inference.  The  optimization  procedure  and  update  equa¬ 
tions  for  0,  0,  A,  a 2  are  shown  in  the  supplementary  doc¬ 
ument  Sec.  A,  due  to  space  limitations. 

We  initialize  both  0V,  0^  to  be  uniform  for  all  people 
in  the  event.  Aas  are  initialized  to  be  the  maximally  sepa¬ 
rated  points  in  the  unary  feature  space  for  an  event  E.  \ps 
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are  similarly  initialized  from  the  pairwise  interaction  fea¬ 
ture  space.  <j2aj  are  initialized  to  0.01  or  0.1  based  on  the 
variance  of  the  event  unary  features.  Similarly,  a2k  are  ini¬ 
tialized  to  10  or  0.1  for  all  events. 

In  every  video  v,  the  person  with  the  highest  value  of 
(j)v  is  assigned  the  reference  role,  forming  a  reference  role 
cluster.  The  corresponding  variational  probability  is 
used  to  assign  secondary  roles  to  other  people  in  the  video. 
We  enforce  a  lower  and  upper  bound  on  the  number  of  peo¬ 
ple  assigned  to  a  secondary  role  cluster  in  the  event.  In 
practice,  the  bounds  are  set  to  a  10%  range  of  the  smallest 
and  largest  ground-truth  cluster  sizes  in  the  event.  This  acts 
a  lose  prior  on  the  number  of  people  in  each  role  cluster. 
Linear  integer  programming  is  used  to  satisfy  these  con¬ 
straints  during  role  assignment,  whose  details  are  shown  in 
supplementary  document  Sec.  B  due  to  space  limitations. 

5.  Experiment  and  Results 
5.1.  Datasets 

YouTube  Social  Roles  Most  publicly  available  video 
datasets  are  not  suitable  for  evaluating  the  social  role  as¬ 
signment  task,  since  they  do  not  cover  a  good  range  of  peo¬ 
ple  donning  different  roles  in  specific  social  events.  In  an  at¬ 
tempt  to  evaluate  our  method,  we  collected  a  set  of  YouTube 
videos  under  4  social  events.  The  details  of  the  dataset  are 
shown  in  Tab.  1 .  To  facilitate  easy  evaluation,  we  annotate 
every  person  in  our  dataset  with  the  relevant  social  roles. 
Some  videos  have  stray  individuals  not  annotated  with  any 
specific  social  role  and  are  called  as  “others”.  Again  it  is  to 
be  noted  that  role  labels  are  used  only  for  evaluation. 

Within  each  social  event,  there  is  wide  variation  in  event 
settings  as  seen  from  the  sample  video  frames  in  Fig.  2. 
Wedding  and  Birthday  videos  were  chosen  to  cover  both  in¬ 
door  and  outdoor  celebrations.  Award  ceremony  includes 
graduation  functions,  presidential  award  functions  as  well 
as  corporate  events.  Similarly,  physical  training  refers  to 
martial  arts,  aerobics  and  other  forms  of  fitness  classes.  This 
diversity  in  scenarios,  with  the  same  underlying  interactions 
between  different  roles  is  an  interesting  characteristic  of  the 
dataset,  and  makes  the  task  amply  challenging. 

TRECVID  Social  Roles  Among  publicly  available 
datasets,  the  TRECVID-MED1 1  event  kits  [1]  have  two  so¬ 
cial  event  classes  birthday  and  wedding.  However,  most  of 
the  videos  in  these  kits  either  have  very  few  characters  or 
crowd  activities  where  people  cannot  be  distinguished  from 
each  other.  Hence,  we  chose  a  smaller  subset,  covering  rea¬ 
sonable  number  of  people  in  different  roles.  Some  videos 
were  cropped  to  include  only  the  parts  showing  relevant  so¬ 
cial  events.  Details  of  the  dataset  are  shown  in  Tab.  2 

Since  human  tracking  is  not  the  focus  of  the  current 
work,  we  obtain  human  tracks  through  the  active  learning 
tool  from  [21].  The  dataset  along  with  the  human  tracks, 


Event 

Name 

Social  Roles 
(No.  of  people  per  role) 

No.  of 

videos 

Avg. 

duration 

Birthday 

Party 

birthday  child  (40), 
parents  (44), 
friends  (71),  guests  (28) 

40 

80.84 

sec. 

Catholic 

Wedding 

bride  (40),  groom  (40), 
priest  (38),  grooms  men  (45), 
brides  maids  (43),  others  (8) 

40 

88.74 

sec. 

Award 

Fun¬ 

ction 

presenter  (40), 
receipient  (309),  host  (25), 
disributor  (17),  others  (13) 

40 

111.13 

sec. 

Physical 

Training 

instructor  (36), 
students  (127) 

36 

50.49 

sec. 

Table  1.  Details  of  the  YouTube  social  roles  dataset. 


Event 

Name 

Social  Roles 
(No.  of  people  per  role) 

No.  of 

videos 

Avg. 

duration 

Birthday 

Party 

birthday  person  (34), 
parent/spouse  (40), 
friends  (59),  guests  (31) 

34 

44.65 

sec. 

Catholic 

Wedding 

bride  (34),  groom  (34), 
priest  (29),  grooms  men  (29), 
brides  maids  (29) 

34 

72.00 

sec. 

Table  2.  Details  of  the  TRECVID  social  roles  dataset. 


and  role  annotations  would  be  made  publicly  available  2 . 

5.2.  Role  Discovery  Results 

In  our  experiments,  we  evaluate  the  model  by  compar¬ 
ing  results  with  human  annotated  roles  in  each  video.  Due 
to  the  weakly  supervised  nature  of  the  problem,  we  do  not 
have  a  direct  mapping  between  role  clusters  and  ground- 
truth  role  labels.  To  facilitate  easy  comparison  with  differ¬ 
ent  baselines,  the  role  clusters  obtained  from  a  method  are 
each  mapped  to  one  of  the  human  defined  roles,  maximizing 
the  total  correct  role  assignments  in  an  event.  We  present  re¬ 
sults  on  the  two  datasets  from  Sec.  5.1  and  compare  our  full 
model  against  different  baselines  in  Tab.  3,  4.  The  tables 
show  the  total  accuracy  of  role  assignment  in  an  event.  The 
baselines  used  for  comparison  are  explained  below. 

•  prior:  Simple  baseline.  A  random  person  in  each  video 
is  assigned  the  reference  role,  and  the  true  prior  of  sec¬ 
ondary  roles  is  used  to  assign  roles  to  other  people  in 
the  video. 

•  k-means:  Simple  experiment,  where  people  are  clus¬ 
tered  using  appearance  and  spatio-temporal  features. 

•  CRF  with  4/u:  To  judge  the  importance  of  interaction 
features,  we  use  a  CRF  with  only  unary  features,  simi¬ 
lar  to  CTRF  in  [26].  We  use  same  priors  as  our  model. 

•  CRF  with  To  demonstrate  the  gain  in  modeling 
inter-role  interactions,  instead  of  using  interactions  as 
context,  the  mean  interaction  feature  of  a  person  with 

2https://sites. google.com/site/eevignesh/socialroles 
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Method 

Birthday 

Wedding 

Award 

Function 

Physical 

Training 

prior 

29.32% 

20.17% 

62.97% 

65.93% 

k-means  cluster 

33.88% 

29.43% 

31.97% 

57.67% 

CRF  with 

38.25% 

39.22% 

69.31% 

76.69% 

CRF  with  ^  up 

41.53% 

38.83% 

77.75% 

77.91% 

Our  model  -  ^rox- 

43.72% 

36.41% 

79.54% 

82.82% 

Our  model  -  V%pat- 

43.72% 

39.32% 

79.80% 

77.91% 

Our  Full  Model 

44.81% 

42.72% 

83.12% 

82.82% 

Table  3.  Total  role  assignment  accuracy  for  the  YouTube  dataset. 
The  best  performance  in  each  event  is  marked  by  bold  font. 


Method 

Birthday 

Wedding 

prior 

28.72% 

21.63% 

k-means  cluster 

29.88% 

34.19% 

CRF  with  4>u 

35.98  % 

38.71  % 

CRF  with  4^  up 

42.07% 

41.94  % 

Our  model  - 

41.46% 

41.29% 

Our  model  -  ^pat- 

43.90% 

41.29% 

Our  Full  Model 

44.51% 

43.87% 

(a)  Wedding 


(b)  Award  function 


Figure  4.  Sample  frames  from  videos  are  shown,  where  our  full 
model  identified  the  correct  (a)  “bride”  (green  box),  “groom” (red 
box)  roles  in  wedding  and  (b)  “presenter”  (green  box),  “recipient” 
(red  box)  roles  in  award  function.  The  same  Hand-Hand  touch 
code  is  seen  to  be  detected  on  different  instances  of  the  same  role 
pair.  The  black  and  white  boxes  are  the  part  detections  from  two 
different  proxemic  models  for  Hand-Hand  touch. 


Table  4.  Total  role  assignment  accuracy  for  the  TRECYID  dataset. 
The  best  performance  in  each  event  is  marked  by  bold  font. 

everyone  else  is  concatenated  with  the  unary  feature, 
forming  4/ up ,  used  in  same  CRF  as  before. 

•  Our  model  -  ^!^rox- :  Full  model  without  4/^ rox • 

•  Our  model  -  4/ :  Full  model  without  4/^ T 

From  results  in  Tab.  3,  we  notice  that  a  CRF  using  4/u 
outperforms  naive  k-means  clustering,  justifying  the  use  of 
this  representation  with  our  unary  features.  Also,  the  use  of 
interaction  as  a  context  feature  in  4^ up  is  seen  to  do  better 
than  the  use  of  only  unary  features,  in  most  events.  This 
confirms  our  belief  that,  human  interactions  are  informative 
for  role  recognition.  In  particular,  we  observe  a  consider¬ 
able  increase  for  the  award  function  event,  where  the  in¬ 
teraction  between  the  “recipient”  and  “presenter”  as  seen 
in  Fig.  4(b)  would  help  distinguish  the  “presenter”  from 
other  people  at  the  dais.  Next,  we  observe  that  our  full 
model  shows  significant  improvement  over  CRF  with  4/ up . 
This  demonstrates  the  value  in  explicitly  modeling  inter¬ 
action  between  role  pairs,  instead  of  using  interaction  as 
a  context  feature.  For  instance,  consider  a  wedding  with 
similar  interactions  between  a  “bride-groom”  pair,  and  a 
“bridesmaid-groomsman”  pair.  These  interactions  lead  to 
the  same  interaction-context  feature,  for  both  the  “bride” 
and  the  “bridesmaid”.  However,  our  full  model  would  treat 
them  differently,  due  to  the  difference  in  the  other  role  par¬ 
ticipating  in  the  interaction,  leading  to  a  richer  description. 

Our  full  model  using  the  complete  pairwise  interaction 
feature  4/ p  performs  better  than  the  models  only  using 
4/^ ' rox-  or  4/ pT ,  showing  the  gain  from  use  of  both  the  com¬ 


ponents.  It  is  interesting  to  note  the  considerable  drop  in 
performance  for  ward  function  and  wedding  events,  in  the 
absence  of  4^roa\  We  observed  that  the  proxemic  mod¬ 
els  corresponding  to  specific  touch-codes  fired  consistently 
across  different  “bride-groom”  and  “presenter-recipient” 
pairs  in  wedding  and  award  functions  respectively,  distin¬ 
guishing  them  from  other  role  pairs  in  the  events.  We  illus¬ 
trate  this  in  Fig.  4. 

To  analyze  the  complete  role  assignment,  we  look  at 
the  confusion  matrices  in  Fig.  5.  The  column  correspond¬ 
ing  to  the  reference  role  cluster  chosen  by  our  algorithm 
is  highlighted  in  each  matrix.  The  average  purities  of  the 
reference  role  clusters  are  0.65  and  0.56,  in  the  YouTube 
and  TRECVID  datasets  respectively.  This  demonstrates 
the  ability  of  our  model  to  isolate  the  reference  role  in 
each  video.  We  observe  that  the  model  is  able  to  cluster 
the  roles  better  in  the  wedding  event,  as  seen  in  Fig.  5(a), 
5(e).  This  can  be  accounted  to  the  strong  interaction  be¬ 
tween  the  “bride”  and  “groom”,  separating  them  from  the 
remaining  roles.  To  study  this  interaction,  we  visualize  the 
marginals  of  the  spatial  relationship  of  different  roles  with 
the  reference  role  (“groom”)  cluster  in  the  YouTube  wed¬ 
ding  dataset,  in  Fig.  6.  The  marginals  capture  the  expected 
interaction,  as  explained  in  the  figure.  The  confusion  of 
“distributor”  with  the  “recipient”  in  Fig.  5(c),  can  be  ex¬ 
plained  by  the  similar  patterns  of  interaction  between  the 
“recipient”  receiving  the  award  from  the  “presenter”,  and 
the  “distributor”  handing  out  the  award  to  the  “presenter”, 
“friends”  are  difficult  to  distinguish  from  “guests”  in  the 
TRECVID  birthday  dataset,  where  we  observed  both  roles 
to  exhibit  low  interaction  with  the  reference  role. 
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Figure  5.  Confusion  matrices  for  different  events  are  shown  for 
the  YouTube  and  TRECVID  Social  Roles  dataset.  The  column 
corresponding  to  the  reference  role  cluster  chosen  by  our  model  is 
highlighted  in  each  event.  (This  figure  is  best  viewed  in  color) 

Sample  results  from  our  full  model  are  shown  in  Fig.  7 
along  with  typical  failure  instances.  Most  failure  cases  in¬ 
volved  less  interaction  among  people,  as  seen  in  the  last 
column  of  birthday ,  wedding  and  physical  training. 

In  order  to  evaluate  the  latent  reference  role  assignment 
in  our  model,  we  compare  performances  with  a  control  set¬ 
ting  which  randomly  chooses  the  reference  role  in  each 
video.  The  average  accuracy  of  role  assignment  over  all 
events  is  seen  to  drop  by  4.82%  for  the  YouTube  social  roles 
dataset  with  this  choice  of  reference  role,  justifying  the  need 
to  model  it  as  a  latent  variable.  In  particular,  we  observe  a 
large  drop  of  6.80%  for  the  wedding  event,  which  has  more 
role  classes  than  the  other  events  leading  to  increased  ran¬ 
domness  in  the  choice  of  reference  role  in  each  video. 

6.  Conclusion 

We  proposed  to  recognize  social  roles  from  human  event 
videos  in  a  weakly  supervised  setting,  and  designed  a  CRF 
to  model  the  inter-role  interactions  along  with  person  spe¬ 
cific  unary  features.  This  weak  supervision  enables  our 
method  to  automatically  understand  the  relations  between 
people,  and  discover  the  different  roles  associated  with  an 
event.  It  further  reduces  the  human  effort  involved  in  ob¬ 
serving  long  video  footages  to  annotate  the  roles.  We 
showed  considerable  performance  improvement  over  dif¬ 
ferent  baseline  models.  As  a  next  step,  our  approach  can  be 
extended  to  perform  simultaneous  event  classification  along 
with  role  discovery.  It  is  also  noted  that  our  method  is  not 


robust  to  noisy  and  fragmented  reference  role  tracking,  due 

to  the  inherent  assumption  of  one  reference  role  per  video. 

In  the  future,  we  wish  to  account  for  such  noisy  tracking. 
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(c)  Bridesmaids 


\'p  ft 


(d)  Groomsmen 
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groom’s  position  is  marked  by  a  cross-hair.  The  “bride”  is  mostly  close  to  the  “groom”,  “groomsmen”  and  “bridesmaids”  are  distributed 
around  the  groom  as  expected.  The  uncertainty  in  recognizing  the  “priest”  is  reflected  by  a  scattered  distribution. 


Figure  7.  Sample  results  from  the  YouTube  social  roles  dataset  is  shown,  where  each  row  corresponds  to  an  event.  Boxes  with  solid  lines 
indicate  correct  role  assignments  from  our  full  model,  while  dashed  lines  represent  faulty  assignments.  Different  roles  are  indicated  by  the 
same  color  code  as  in  Fig.  2.  The  ground  truth  role  of  a  person  is  indicated  by  the  color  of  the  dot  on  the  person.  Last  column  shows  typical 
failure  cases  for  each  event. 
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